• Quick Start
  • Get to Know The Data
    • Data sources
    • Data types
  • TASK 1: Use Multi-Omics Data to Predict Drug Response
  • TASK 2: Predict Drug Response Using Panel Sequencing Data.
  • TASK 3: Improving Drug Response Predictions by Transforming the Data in a Batch-Aware Manner.
  • Summary of Data to be Used in TASK 1
  • Summary of Data to be Used in TASK 2
  • The Drug Response Data
  • References

Welcome to the capstone project of the “Computational Genomics: A Hands-on course on Machine Learning for Genomics.” You can find more information about the course on the course page. You can access the the Github repository of the course here. You can also access the Github repository of the capstone project from here.

The capstone project tasks are designed using data from a real world problem. The participants who provide the best reports for the capstone projects will be invited to co-author a manuscript with the Akalin lab. We are looking for best performing methods and in addition also looking for interesting visualization and/or data processing/correction methods that might enrich the manuscript.

The main aim of the tasks is to predict the response of 6 oncology drugs on different disease models. There are 3 main tasks as follows:

  • TASK 1: Predict drug response from multi-omics data
  • TASK 2: Predict drug response from panel sequencing data
  • TASK 3: Improving drug response predictions by transforming data in a batch-aware manner.

The following page contains the tasks and all of the necessary information to start the capstone project. Please provide documented code and figures for the problems below in PDF format. The report does not have to include the time consuming parts regarding model tuning etc.. You can run those computational experiments outside the report and simply include the best parameters in the report explaining in the text how you tuned the parameters.

In addition, you will need to fill a form about your results for us to be able to standardize the results we got.


Quick Start

All data can be accessed at here.

  • You can download the data by using Linux/Ubuntu command window as well. To download the data, copy the link of the file from the HTML file provided to you and just type wget link.

  • For example: wget https://bimsbstatic.mdc-berlin.de/akalin/AAkalin_DrugResponse/prepared/AAkalin_DrugResponse-prepared.tar.gz. Typing this command will download the entire data folder.If you’re having any issues on wget command, you can get more information from the link here.

  • Once you download the data, you need to unzip it.You can use gunzip() function to unzip the file with .gz extension. Additionally, you may also need to use R.utils::untar() function to untar the .tar files. All of the data is in .tsv format. There is no significantly difference between the .csv and .tsv format, so you can simply read .tsv by specifying sep = '\t', or by using read.delim() function.

  • For example: read.delim(R.utils::gunzip("data/CCLE/drug_response/TRAMETINIB.tsv.gz")) and you can assign the data you read to a variable.

From the code below, you can see how the PCA graph of gene expression is plotted by coloring it based on the drug.

#read the gene exp data
CCLE_gex <- read.delim("data/CCLE/all_intersect_genes/gex.tsv") %>% 
  rename(cell_lines = X) #Change the name of cell lines column to cell_lines
# with the dimension of 1305 x 18356

#read the drug response data
drug_resp_CCLE <- read.delim("data/CCLE/drug_response/ALPELISIB.tsv") %>%
    dplyr::bind_rows(read.delim("data/CCLE/drug_response/BUPARLISIB.tsv")) %>%
    dplyr::bind_rows(read.delim("data/CCLE/drug_response/CETUXIMAB.tsv")) %>%
    dplyr::bind_rows(read.delim("data/CCLE/drug_response/ERLOTINIB.tsv")) %>%
    dplyr::bind_rows(read.delim("data/CCLE/drug_response/RIBOCICLIB.tsv")) %>%
    dplyr::bind_rows(read.delim("data/CCLE/drug_response/TRAMETINIB.tsv")) %>%
    rename(drug = column_name, cell_lines = sample_id)
#with the dimension of 4190 x 5

#Remove near zero variation 
nzv <- caret::preProcess(CCLE_gex,method="nzv",uniqueCut = 15)
CCLE_gex_pp <- predict(nzv,CCLE_gex)
#with the dimension of 3045 x 16023

#center scale and pca
preprocessParams  <- caret::preProcess(CCLE_gex_pp,method=c("pca"))
CCLE_gex_pp <- predict(preprocessParams,CCLE_gex_pp)

#combine the drug response values
ccle_data <- dplyr::inner_join(drug_resp_CCLE, CCLE_gex_pp, by = "cell_lines") %>%
  select(!data_type  & !variable_type)
#with the dimension of 3045 x 18358

#Save the plots
ALPpcaplot <- ccle_data %>% filter(drug == "ALPELISIB") %>%
  ggplot(aes(x=PC1,y=PC2))+
  geom_point(aes(color =value))+
  ggtitle("ALPELISIB")+
  theme(legend.position = "right")+
  scale_color_gradient(low = "orange", high = "Blue")

BUPpcaplot <- ccle_data %>% filter(drug == "BUPARLISIB") %>%
  ggplot(aes(x=PC1,y=PC2))+
  geom_point(aes(color =value))+
  ggtitle("BUPARLISIB")+
  theme(legend.position = "right")+
  scale_color_gradient(low = "orange", high = "Blue")
    
CETpcaplot <- ccle_data %>% filter(drug == "CETUXIMAB") %>%
  ggplot(aes(x=PC1,y=PC2))+
  geom_point(aes(color =value))+
  ggtitle("CETUXIMAB")+
  theme(legend.position = "right")+
  scale_color_gradient(low = "orange", high = "Blue")

ERLpcaplot <- ccle_data %>% filter(drug == "ERLOTINIB") %>%
  ggplot(aes(x=PC1,y=PC2))+
  geom_point(aes(color =value))+
  ggtitle("ERLOTINIB")+
  theme(legend.position = "right")+
  scale_color_gradient(low = "orange", high = "Blue")

RIBpcaplot <- ccle_data %>% filter(drug == "RIBOCICLIB") %>%
  ggplot(aes(x=PC1,y=PC2))+
  geom_point(aes(color =value))+
  ggtitle("RIBOCICLIB")+
  theme(legend.position = "right")+
  scale_color_gradient(low = "orange", high = "Blue")

TRApcaplot <- ccle_data %>% filter(drug == "TRAMETINIB") %>%
  ggplot(aes(x=PC1,y=PC2))+
  geom_point(aes(color =value))+
  ggtitle("TRAMETINIB")+
  theme(legend.position = "right")+
  scale_color_gradient(low = "orange", high = "Blue")

cowplot::plot_grid(ALPpcaplot,BUPpcaplot,CETpcaplot,ERLpcaplot,RIBpcaplot,TRApcaplot)

  • NOTE : To use mutation data, you need to make it a binary matrix. However, you may have limited memory problems with using this function. We recommend you run the code and save the data as .rds, then read the .rds file in a new session. You can do it by following the code snippet below.
#necessary libraries
library(reshape2)
library(data.table)

#read the data
CCLE_mut_pan <- read.delim("data/CCLE/oncoKB_intersect_genes/mut.tsv") %>%
  select(sample_id,gene_id,protein_change) %>% #Select the necessary columns
  as.data.table() %>% melt(id.vars = c("sample_id","gene_id"))

CCLE_mut_pan <- dcast(CCLE_mut_pan, sample_id ~ gene_id + value, length)
saveRDS(CCLE_mut_pan,"CCLE_mut_pan.rds") #save the data as a RDS file

#Now open a new session then you can read and assign the data as a data frame
ccle_mut_pan <- as.data.frame(readRDS("ccle_mut_pan.rds"))

Get to Know The Data

Data sources

  • There are four main sections in the data file, which are CCLE, PDX_ALL, GDSC_broad and GDSC_sanger. Each of these main folders contains three subfolders, as indicated in the README file as well. The all_intersect_genes folder contains all of the genomic data which are the predictive features that we will be using in our machine learning model. The drug_response folder contains the drug response data which is the continuous response variable that we will be training a model to predict. There is six different drug response data present within the drug_response folder.

  • The Cancer Cell Line Encyclopedia (CCLE) project is an effort to conduct a detailed genetic characterization of a large panel of human cancer cell lines.The CCLE is an independent dataset produced by the BROAD institute. The CCLE provides public access to genomic data, visualization, and analysis for over 1100 cancer cell lines. You can get more info on the CCLE from here. Ghandi et al. (2019)

  • Genomics of Drug Sensitivity in Cancer (GDSC) is the largest public resource database for information on drug sensitivity in cancer cells and molecular markers of drug response. Data are freely available without restriction. GDSC currently contains drug sensitivity data for more than 75,000 experiments, describing the response to 138 anticancer drugs across more than 700 cancer cell lines. You can get more info on the GDSC from here. Yang et al. (2012)

  • Genomics of Drug Sensitivity in Cancer (GDSC) consists of two batches:

    1. BROAD: half of cell lines sequenced in Broad, from BROAD cell line library, which you can find in GDSC_BROAD data folder.
    2. SANGER: half of cell lines sequenced in Sanger from the SANGER cell line library which you can find in GDSC_SANGER data folder.

NOTE : Even though they might have overlapping cell lines,GDSC_BROAD and CCLE are different datasets.

  • The data within The CCLE and The GDSC datasets are based on cancer cell lines . Cancer cell lines are the tumor samples that are usually at point isolated from patients but since then grown in the lab environment. Cell lines are considered poor representations of actual tumors despite being widely utilized in research.

  • PDX, on the other hand, represents the tumor structure better than cell lines. PDXstands for Patient-Derived Xenograft. PDX is the re-creation of a tumor sample taken by biopsy from the patient by xenografting into the mice and letting it grow there. Since tumors grow in an organism, they are closer to real tumors and are more meaningful for pre-clinical studies. In Figure below you can see the overall representation for using PDX to test drug efficacy. Jung, Seol, and Chang (2017) And the cell cultivation methodologies are represented in Figure 2. Cho et al. (2016) Ravi, Ramesh, and Pattabhi (2016)

knitr::include_graphics(c("PDX.png","cellculture.jpg"))


Data types

Gene Expression

  • Cancer cells have altered gene expression. There could be many genes that are abberantly turned on or off that dramatically alter the overall activity of the cell. Gene expression data is provided for all sets in the multi-omics-related tasks.

  • NOTE that the gene expression data was transformed. (values: log10 FPKM + 0.01)


Copy Number of Variation

  • A CNV(Copy Number of Variation) occurs is when the number of copies of a particular gene or region vary from one individual to the next. It has long been recognized that some cancers are associated with elevated copy numbers of particular genes.For more information you can click the link here.

  • You’ll notice that the CNV data within the folders consists of zeros (0), ones (1) and minus ones (-1).

    1. Genes with focal CNV values smaller than -0.3 are categorized as a “loss” (-1)
    2. Genes with focal CNV values larger than 0.3 are categorized as a “gain” (+1)
    3. Genes with focal CNV values between and including -0.3 and 0.3 are categorized as “neutral” (0).

Mutation

  • A mutation is a change in the genome sequence compared to a reference. It may or may not result in abnormal proteins and/or changes in gene expression. Cancer is thought to develop due to mutations in key genes which start a cascade of molecular changes which results in a malignant cells, which multiply uncontrollably and could invade other tissues. We provide mutation data in an harmonized manner for all datasets. We recommend binarizing the mutation data before machine learning, one way to do this is provided above in the “Quick Start Section.” Other similar methods also exists here: https://datatricks.co.uk/one-hot-encoding-in-r-three-simple-methods

Drug response

  • Drug response data indicates the drug efficacy in cell lines and PDX. They are widely used in cancer research to prioritize drugs to be tested in patients. You’ll notice the metric of the drug response is AUC. In pharmacology, the area under the plot of plasma concentration of a drug versus time after dosage (called “area under the curve” or AUC) gives insight into the extent of exposure to a drug and its clearance rate from the body. AUC combines information about drug efficacy and potency into a single value, and it was reported to be the most robust metric previously. To learn more on response metrics click this link We provide AUC values for PDX dataset as well, however AUC values are derived from tumor volume change for PDX. It should be mentioned that the AUC does not measure the same things on the cell lines and PDX samples, and that the drug response distribution on PDX and Cell lines can be markedly different, even for same drugs. However, we expect/assume predictions on PDX using models trained on cell line AUCs should be correlated to PDX AUCs.

TASK 1: Use Multi-Omics Data to Predict Drug Response

This task is about using multi-omics data to predict drug response (gene expression, CNVs and mutations). The datasets to be used are in all_intersect_genes and drug_response folders in GDSC_Broad, GDSC_Sanger, CCLE and PDX_all folders. PDX_all should not be used for training, only for testing. The model parameters during training should not be optimized in anyway towards PDX prediction performance. This is extremely important. You should use all omics dataset in training not just expression.

The sub-tasks are as follows:

  1. Check if you need some sort of normalization, filtering and/or transformation. These methods have to be applied in all training and test datasets. More complicated methods for batch effect correction is on Task 3 and don’t belong in this task. Just check if normalization, transformation and filtering is needed. And apply whatever you think is necessary. You should first look at normalization and filtering per omic data type. You then can choose to concatanate filtered and normalized single omic datasets into a multi-omics matrix prior to training. You can also experiment to see how different data processing techniques affects accuracy. But those experiments do not belong in the final report, just include the final processing method and state if you did some experimentation to arrive at that method.

  2. Train machine learning models to predict drug response on each cell line dataset (GDSC_Sanger, GDSC_Broad, CCLE) using 10-fold cross-validation. You can use any machine learning model available to you in R and tune its performance. You would train a model for each drug and each cell line dataset. For those best, performaing models, make a table or heatmap (with values printed on the cells) for the CV R-squared values. columns: CCLE, GDSC_Broad, GDSC_Sanger and rows: ALPELISIB, BUPARLISIB, CETUXIMAB, ERLOTINIB, RIBOCICLIB, TRAMETINIB. Do not change this row/column order.

  3. By now you have a tuned model for each drug and cell line dataset combination. What happens when you use a model trained on one cell line dataset on on the other two? For example, for a given drug say Cetuximab, apply the CCLE model on GDSC_Sanger and GDSC_Broad datasets and record R-squared for those predictions. Then, you can take the GDSC_Broad model and apply it on other two and record R-squared values. If you keep doing this, for each drug you will have six values for the R-squared measure. Output median and mean values as a table, and make boxplots for R-squared values of each drug as obtained above. Is there a dataset that results in poor models for many drugs?

  4. Take the best model from sub-task 1 and also the combine the models models in an ensemble learner ( this could be simply averaging their output) and predict drug response on the PDX data set. Plot the metrics (R-squared, Pearson correlation and Spearman’s Correlation) for observed vs predicted efficacy of each drug. Did the simple model perform better than the ensemble model?

  • HINT : How to decide the best model from step-1 for a given drug, this could be simply looking at CV R-squared values, or pick the model which has similar CV R-squared values to the out-of-sample R-squared values on the two other datasets. Anything reasonable is OK.

  • HINT : For the ensemble model, you can simply average output or re-train an ensembl model with cross-validation. You don’t have to use all the models in the ensemble, you can choose to remove a model if you believe it performs poorly. Anything is OK but not every approach will be performant. Use your imagination.

  • IMPORTANT NOTE : the PDX data set is for testing ONLY and it shouldn’t be used for training the model or optimizing any training parameter including ensemble models.


TASK 2: Predict Drug Response Using Panel Sequencing Data.

This task is about using panel sequencing data to predict drug response (CNVs and mutations for few hundred genes). This kind of data is usually obtained as part of cancer diagnostics therefore more easily accesible for clinical samples. However, since the panel sequencing data measures limted set of features from a limited set of genes, it may not be the best choice for clinical variable modeling. The datasets should be used for this task are in oncoKB_intersect_genes and drug_response folders in GDSC_Broad, GDSC_Sanger, CCLE and PDX_all folders. PDX_all should not be used for training, only for test. The model parameters during training should not be optimized in anyway towards PDX prediction performance. This is extremely important.

The sub-tasks are as follows:

  1. Check if you need some sort of normalization, filtering and/or transformation. These methods have to be applied in all training and test datasets. More complicated methods for batch effect correction is on Task 3 and don’t belong in this task. Just check if normalization, transformation and filtering is needed. And apply whatever you think is necessary. You can also experiment to see how different data processing techniques affects accuracy. But those experiments do not belong in the final report, just include the final processing method and state if you did some experimentation to arrive at that method.

  2. Train machine learning models to predict drug response on each cell line dataset (GDSC_Sanger, GDSC_Broad, CCLE) using 10-fold cross-validation. You can use any machine learning model available to you in R and tune its performance. You would train a model for each drug and each cell line dataset. For those best, performaing models, make a table or heatmap (with values printed on the cells) for the CV R-squared values. columns: CCLE, GDSC_Broad, GDSC_Sanger and rows: ALPELISIB, BUPARLISIB, CETUXIMAB, ERLOTINIB, RIBOCICLIB, TRAMETINIB. Do not change this row/column order.

  3. By now you have a tuned model for each drug and cell line dataset combination. What happens when you use a model trained on one cell line dataset on on the other two? For example, for a given drug say Cetuximab, apply the CCLE model on GDSC_Sanger and GDSC_Broad datasets and record R-squared for those predictions. Then, you can take the GDSC_Broad model and apply it on other two and record R-squared values. If you keep doing this, for each drug you will have six values for R-squared. Output median and mean values as a table, and make boxplots for R-squared values of each drug as obtained above.

  4. Take the best model from sub-task 1 and also the combine the models models in an ensemble learner ( this could be simply averaging their output) and predict drug response on the PDX data set. Plot the metrics (R-squared, Pearson correlation and Spearman’s Correlation) for observed vs predicted efficacy of each drug. Did the simple model perform better than ensemble model?

  • HINT : How to decide the best model from step-1 for a given drug, this could be simply looking at CV R-squared values, or pick the model which has similar CV R-squared values to the out-of-sample R-squared values on the two other datasets. Anything reasonable is OK.

  • HINT : For the ensemble model, you can simply average output or re-train an ensembl model with cross-validation. You don’t have to use all the models in the ensemble, you can choose to remove a model if you believe it performs poorly. Anything is OK but not every approach will be performant. Use your imagination.

  • IMPORTANT NOTE : the PDX data set is for test ONLY and it shouldn’t be used for training the model or optimizing any training parameter including ensemble models.

  1. Compare the best performing models for PDX prediction from TASK 1 and TASK2. Make informative figures and briefly discuss (2-3 sentences) if multi-omics or panel sequencing is better for modeling response judging from the PDX results.

TASK 3: Improving Drug Response Predictions by Transforming the Data in a Batch-Aware Manner.

The PDX and the cell line datasets are biologically different. On top of this, technical differences due to data acquisition and processing exists. These differences might drive the patterns we observe more than the biology that is related to the drug response, and the differences create so-called “batches” of data where data points cluster by technical effects rather than patters of interesting biology. Ideally, we can try to remove these differences and train models on datasets where these differences are removed. In many cases, we may not be able to completely remove these effects or removing these effects could also remove interesting biological patterns. However, there is also potential to produce better models that could have higher performance on PDX models.

Keeping these in mind, the sub-tasks are as follows:

  1. Apply batch-removal methods on multi-omics datasets (TASK 1) including PDX. You can choose to work with a subset or only one of the cell line datasets but PDX dataset must be in there prior to batch effect removal. After removal of the effect, train drug response prediction onlny on the cell line dataset(s) and predict on PDX. Report if R-squared improved over TASK1 or not. You can try multiple batch removal methods. If that’s the case, report the PDX prediction performance for each of the methods. Make easy to undetstand & presentable plots.

  2. Apply batch-removal methods to panel sequnecing datasets (TASK 2) including PDX. You can choose to work with a subset or only one of the cell line datasets but PDX dataset must be in there prior to batch effect removal. After removal of the effect, train drug response prediction only on the cell line dataset(s) and predict on PDX. Report if R-squared improved over TASK 2 or not. You can try multiple batch removal methods. If that’s the case, report the PDX prediction performance for each of the methods. Make easy to undetstand & presentable plots.

You might have problems applying these methods to binarized datasets such as mutation matrices. But you can try data transformations to make them work. One easy trick to try on binary matrices is to set 0s to -1. Another one could be to try logistic regression residuals instead of linear regression residuals ( regressBatches function uses linear regression residuals). You should probably try to use these methods in each omic data type seperately and construct corrected matrices for each of them. There are methods that could work on multi-omics datasets to do matrix factorization, you can try them as well and use them in an SVD-like manner as suggested above, ex: find latent factors associated with batch and remove them.

Summary of Data to be Used in TASK 1

Below some of the necessary summary statistics about the data you will use to complete tasks. A summary of drug response data is given at the end of the document.

The CCLE Data

The dimension of the gene expression data is 1305 x 18357. There are also no missing values present within the dataset. Since the dimensions of the data was high, only the first fifthy columns were analyzed statistically.

#read the data
CCLE_gex <- read.delim("data/CCLE/all_intersect_genes/gex.tsv") %>% 
  rename(cell_lines = X) #Change the name of cell lines column to cell_lines

#Check for the normality
 hist(unlist(CCLE_gex[,2:50]),xlab="Gene Expression values",
     main="The CCLE Gene Expression ",border="white",
     col="cornflowerblue",bins = 50)

#Plot the boxplots for the Gene Expression values of the first 50 columns
 boxplot(CCLE_gex[,2:50],outline=FALSE, col="cornflowerblue",
        xlab = "Gene Expression values",xaxt="n",
        ylab = "Frequency",main = "The CCLE Gene Expression")

#check the missing values of the data set
CCLE_gex[CCLE_gex == "NA"] <- NA
table(is.na(CCLE_gex)) #FALSE:23954580  
#There are no missing values

There are no missing values present. Heatmap of the CNV data was plotted. The dimension of the CNV data is 1745 x 18356 .

#read the data
CCLE_cnv <- read.delim("data/CCLE/all_intersect_genes/cnv.tsv") %>%
  rename(cell_lines = X) #Change the name of cell lines column to cell_lines

cols <- colorRampPalette(brewer.pal(6,name="Blues"))(12)

pheatmap::pheatmap(CCLE_cnv[,2:50],cluster_rows=FALSE,show_rownames = FALSE,main="Copy Number Data from the CCLE",color = cols)

#check the missing values of the data set
CCLE_cnv[CCLE_cnv == "NA"] <- NA
table(is.na(CCLE_cnv)) #FALSE:32031220   
#There are no missing values

Below, the variant classification, top 10 mutated genes, variants per sample plots were plotted.

NOTE : TheCCLE mutation data contain missing values, so you have to deal with them in order to use.

CCLE_mutdelim <- read.delim(file = "data/CCLE/all_intersect_genes/mut.tsv", 
                               sep = "\t", header = TRUE, fill = TRUE,
                            comment.char = "#") #read the mutation data 

CCLE_mutdelim$Tumor_Seq_Allele2 <- CCLE_mutdelim$Tumor_Seq_Allele1
CCLE_mut <- read.maf(maf= CCLE_mutdelim) #to convert and read it as maf file
## -Validating
## --Removed 6264 duplicated variants
## --Non MAF specific values in Variant_Classification column:
##   5UTR
##   5Flank
##   Start_Codon_Del
##   Stop_Codon_Ins
##   3UTR
## -Silent variants: 355334 
## -Summarizing
## --Possible FLAGS among top ten genes:
##   TTN
##   MUC16
##   OBSCN
##   SYNE1
##   USH2A
## -Processing clinical data
## --Missing clinical data
## -Finished in 40.0s elapsed (39.5s cpu)
#Plot the summary of mutation data 
plotmafSummary(maf = CCLE_mut, rmOutlier = TRUE, addStat = 'median',
               dashboard = TRUE,titvRaw = FALSE)

#check the missing values of the data set
CCLE_mutdelim[CCLE_mutdelim == "NA"] <- NA
table(is.na(CCLE_mutdelim)) #FALSE: 43439347 , TRUE: 886853 
#There are missing values

The GDSC BROAD Data

A histogram and a boxplot were plotted for representing distribution of the gene expression values of the GDSC BROAD dataset. There are no missing values within the dataset. Check the distribution of data. The dimensions of the gene expression data is 613 x 18356.

#read the data
BROAD_gex <- read.delim("data/GDSC_broad/all_intersect_genes/gex.tsv") %>% 
  rename(cell_lines = X) #Change the name of cell lines column to cell_lines

#Check for the normality
 hist(unlist(BROAD_gex[,2:50]),xlab="Gene Expression values",
     main="The GDSC BROAD Histogram ",border="white",
     col="Green",bins = 50)

#Plot the boxplots for the Gene Expression values of the first 50 columns
 boxplot(BROAD_gex[,2:50],outline=FALSE, col="Green",
        xlab = "Gene Expression values",xaxt="n",
        ylab = "Frequency",main = "The GDSC BROAD BoxPlot")

#check the missing values of the data set
BROAD_gex[BROAD_gex == "NA"] <- NA
table(is.na(BROAD_gex)) #FALSE:11252228   
#There are no missing values

heatmap of the CNV data was plotted. There are no missing values present within the dataset.The dimensions of the CNV data is 891 x 18356.

#read the data
BROAD_cnv <- read.delim("data/GDSC_broad/all_intersect_genes/cnv.tsv") %>%
  rename(cell_lines = X) #Change the name of cell lines column to cell_lines

cols <- colorRampPalette(brewer.pal(6,name="Greens"))(12)

pheatmap::pheatmap(BROAD_cnv[,2:50],cluster_rows=FALSE,show_rownames = FALSE,
                   main="Copy Number Data from the GDSC BROAD",col = cols)

#check the missing values of the data set
BROAD_cnv[BROAD_cnv == "NA"] <- NA
table(is.na(BROAD_cnv)) #FALSE:11252228   
#There are no missing values

The frequency of each variant type and, top 10 mutated genes were plotted.

NOTE : There are missing values present within the data.

BROAD_mutdelim <- data.table::fread(
  file = "data/GDSC_broad/all_intersect_genes/mut.tsv",sep = "\t", 
  header = TRUE, fill = TRUE) #read the mutation data

genesum <- maftools::getGeneSummary(MAF(BROAD_mutdelim)) %>% group_by(Hugo_Symbol) %>% select(Hugo_Symbol,AlteredSamples) %>%
  arrange() %>% head(10)
## -Processing clinical data
## --Missing clinical data
Variant_Type <- BROAD_mutdelim %>% group_by(Variant_Type) %>%
count() %>% rename(frequency = n)

var_plot <- ggplot(data=Variant_Type, aes(x=Variant_Type, y = frequency, fill=Variant_Type))+
  geom_bar(stat="identity") +
  coord_flip() +
  ggtitle("Variant Type") +
  ylab("Frequency") +
  xlab("Variant Type")

ggplot(data=genesum, aes(x=reorder(Hugo_Symbol, AlteredSamples), y = AlteredSamples, fill = AlteredSamples))+
  geom_bar(stat="identity") +
  coord_flip() +
  ggtitle("Top 10 Mutated Genes Based on Altered Samples") +
  ylab("Number of Altered Samples") +
  xlab("Gene Names")+
  theme(legend.position = "right")

ggplotly(var_plot)
03000006000009000001200000DELINSSNP
DELINSSNPVariant TypeFrequencyVariant TypeVariant_Type
#check the missing values of the data set
BROAD_mutdelim[BROAD_mutdelim == "NA"] <- NA
table(is.na(BROAD_mutdelim)) #FALSE: 22905359  , TRUE: 24115732  
#There are missing values

The GDSC SANGER Data

A histogram and a boxplot were plotted to represent the expression values of the first fifty genes. There are no missing values present within the dataset. Check the distribution of the gene expression values. The dimensions of the gene expression data is 384 x 18356.

#read the data
SANGER_gex <- read.delim("data/GDSC_sanger/all_intersect_genes/gex.tsv") %>% 
  rename(cell_lines = X) #Change the name of cell lines column to cell_lines

#Check for the normality
 hist(unlist(SANGER_gex[,2:50]),xlab="Gene Expression values",
     main="The GDSC SANGER Histogram ",border="white",
     col="Orange",bins = 50)

#Plot the boxplots for the Gene Expression values of the first 50 columns
 boxplot(SANGER_gex[,2:50],outline=FALSE, col="Orange",
        xlab = "Gene Expression values",xaxt="n",
        ylab = "Frequency",main = "The GDSC SANGER BoxPlot")

#check the missing values of the data set
SANGER_gex[SANGER_gex == "NA"] <- NA
table(is.na(SANGER_gex)) #FALSE:7048704   
#There are no missing values

Heatmap of the CNV data was plotted. There are no missing values present within the dataset. The dimensions of the CNV data is 891 x 18356.

#read the data
SANGER_cnv <- read.delim("data/GDSC_sanger/all_intersect_genes/cnv.tsv") %>%
  rename(cell_lines = X) #Change the name of cell lines column to cell_lines
cols <- colorRampPalette(brewer.pal(6,name="Oranges"))(12)

pheatmap::pheatmap(SANGER_cnv[,2:50],cluster_rows=FALSE,show_rownames = FALSE,
                     main="Copy Number Data from the GDSC SANGER", col=cols)

#check the missing values of the data set
SANGER_cnv[SANGER_cnv == "NA"] <- NA
table(is.na(SANGER_cnv)) #FALSE:16355196    
#There are no missing values

The frequency of each variant type and, top 10 mutated genes were plotted.

NOTE : There are missing values present in the mutation data.

SANGER_mutdelim <-data.table::fread(
  file = "data/GDSC_sanger/all_intersect_genes/mut.tsv",sep = "\t", 
  header = TRUE, fill = TRUE)#read the mutation data 

genesum <- maftools::getGeneSummary(MAF(SANGER_mutdelim)) %>%
  group_by(Hugo_Symbol) %>% select(Hugo_Symbol,AlteredSamples) %>% arrange() %>%
  head(10)
## -Processing clinical data
## --Missing clinical data
Variant_Type <- SANGER_mutdelim %>% group_by(Variant_Type) %>% count() %>%
  rename(frequency = n)

var_plot <- ggplot(data=Variant_Type, aes(x=Variant_Type, 
                                          y = frequency, fill=Variant_Type))+
  geom_bar(stat="identity") +
  coord_flip() +
  ggtitle("Variant Type") +
  ylab("Frequency") +
  xlab("Variant Type")

ggplot(data=genesum, aes(x=reorder(Hugo_Symbol, AlteredSamples), y = AlteredSamples, fill = AlteredSamples))+
  geom_bar(stat="identity") +
  coord_flip() +
  ggtitle("Top 10 Mutated Genes Based on Altered Samples") +
  ylab("Number of Altered Samples") +
  xlab("Gene Names")+
  theme(legend.position = "right")

ggplotly(var_plot)
03000006000009000001200000DELINSSNP
DELINSSNPVariant TypeFrequencyVariant TypeVariant_Type
#check the missing values of the data set
SANGER_mutdelim[SANGER_mutdelim == "NA"] <- NA
table(is.na(SANGER_mutdelim)) #FALSE: 24111028   , TRUE: 24115732   
#There are missing values

The PDX Data

There are no missing values present within the PDX_all data set. Check the distribution of gene expression. How does it differ from the distribution of gene expression values in other datasets? The dimension of the gene expression data is 399 x 18356.

#read the data
PDX_gex <- read.delim("data/PDX_all/all_intersect_genes/gex.tsv") %>% 
  rename(cell_lines = X) #Change the name of cell lines column to cell_lines

#Check for the normality
 hist(unlist(PDX_gex[,2:50]),xlab="The PDX Dataset",
     main="The PDX Gene Expression ",border="white",
     col="Purple")

#Plot the boxplots for the Gene Expression values of the first 50 column
PDX_gex[,2:50] %>% boxplot(names = FALSE,outline=FALSE, col="Purple",
                           xlab = "Gene Expression values",ylim = c(0,400),
                           ylab = "Frequency",main = "The PDX Gene Expression")

#check the missing values of the data set
PDX_gex[PDX_gex == "NA"] <- NA
table(is.na(PDX_gex)) #FALSE:23954580   
#There are no missing values

There are no missing values present. Heatmap of the CNV data was plotted. The dimension of the CNV data is 399 x 18356.

#read the data
PDX_cnv <- read.delim("data/PDX_all/all_intersect_genes/cnv.tsv") %>%
  rename(cell_lines = X) #Change the name of cell lines column to cell_lines

cols <- colorRampPalette(brewer.pal(6,name="Purples"))(12)
pheatmap::pheatmap(PDX_cnv[,2:50],cluster_rows=FALSE,show_rownames = FALSE,
                     main="Copy Number Data from the PDX",color = cols)

#check the missing values of the data set
PDX_cnv[PDX_cnv == "NA"] <- NA
table(is.na(PDX_cnv)) #FALSE:16355196     
#There are no missing values

There are missing values present in the mutation data. The histogram of the mutation frequency values were plotted.

PDX_mutdelim <- read.delim(file = "data/PDX_all/all_intersect_genes/mut.tsv", 
                               sep = "\t", header = TRUE, fill = TRUE,
                            comment.char = "#") #read the mutation data 

hist(unlist(PDX_mutdelim[,84]),xlab="Mutation Values",
     main="The PDX Mutation Frequency",border="white",
     col="Purple",bins = 10)

#check the missing values of the data set
PDX_mutdelim[PDX_mutdelim == "NA"] <- NA
table(is.na(PDX_mutdelim)) #FALSE: 806164 , TRUE: 8432975    
#There are missing values

Summary of Data to be Used in TASK 2

Below are all the necessary summary statistics about the data you will use to complete tasks. A summary of drug response data is given at the end of the document.

The CCLE Data

There are no missing values present. Heatmap of the CNV data was plotted. The dimension of the CNV data is 1745 x 883 .

#read the data
CCLE_cnv_pan <- read.delim("data/CCLE/oncoKB_intersect_genes/cnv.tsv") %>%
  rename(cell_lines = X) #Change the name of cell lines column to cell_lines

cols <- colorRampPalette(brewer.pal(6,name="Blues"))(12)

pheatmap::pheatmap(CCLE_cnv_pan[,2:50],cluster_rows=FALSE,show_rownames = FALSE,main="Copy Number Data from the CCLE",color = cols)

#check the missing values of the data set
CCLE_cnv_pan[CCLE_cnv_pan == "NA"] <- NA
table(is.na(CCLE_cnv_pan)) #FALSE:1540835    
#There are no missing values

Below, the variant classification, top 10 mutated genes, variants per sample plots were plotted.

NOTE : TheCCLE mutation data contain missing values, so you have to deal with them in order to use.

CCLE_mutdelim_pan <- read.delim(file = "data/CCLE/oncoKB_intersect_genes/mut.tsv",
                               sep = "\t", header = TRUE, fill = TRUE,
                            comment.char = "#") #read the mutation data 

CCLE_mutdelim_pan$Tumor_Seq_Allele2 <- CCLE_mutdelim_pan$Tumor_Seq_Allele1
CCLE_mut_pan <- read.maf(maf= CCLE_mutdelim_pan) #to convert and read it as maf file
## -Validating
## --Removed 522 duplicated variants
## --Non MAF specific values in Variant_Classification column:
##   Stop_Codon_Ins
##   3UTR
##   Start_Codon_Del
##   5Flank
## -Silent variants: 31994 
## -Summarizing
## -Processing clinical data
## --Missing clinical data
## -Finished in 2.250s elapsed (2.000s cpu)
#Plot the summary of mutation data 
plotmafSummary(maf = CCLE_mut_pan, rmOutlier = TRUE, addStat = 'median',
               dashboard = TRUE,titvRaw = FALSE)

#check the missing values of the data set
CCLE_mutdelim_pan[CCLE_mutdelim_pan == "NA"] <- NA
table(is.na(CCLE_mutdelim_pan)) #FALSE: 3719170 , TRUE: 77150  
## 
##   FALSE    TRUE 
## 3719170   77150
#There are missing values

The GDSC BROAD Data

Heatmap of the CNV data was plotted. There are no missing values present within the dataset.The dimensions of the CNV data is 891 x 18356.

#read the data
BROAD_cnv_pan <- read.delim("data/GDSC_broad/oncoKB_intersect_genes/cnv.tsv") %>%
  rename(cell_lines = X) #Change the name of cell lines column to cell_lines

cols <- colorRampPalette(brewer.pal(6,name="Greens"))(12)

pheatmap::pheatmap(BROAD_cnv_pan[,2:50],cluster_rows=FALSE,
                   show_rownames = FALSE,
                   main="Copy Number Data from the GDSC BROAD",col = cols)

#check the missing values of the data set
BROAD_cnv_pan[BROAD_cnv_pan == "NA"] <- NA
table(is.na(BROAD_cnv_pan)) #FALSE:786753    
#There are no missing values

The frequency of each variant type and, top 10 mutated genes were plotted.

NOTE : There are missing values present within the data.

BROAD_mutdelim_pan <- data.table::fread(
  file = "data/GDSC_broad/oncoKB_intersect_genes/mut.tsv",sep = "\t", 
  header = TRUE, fill = TRUE) #read the mutation data

genesum <- maftools::getGeneSummary(MAF(BROAD_mutdelim_pan)) %>% group_by(Hugo_Symbol) %>% select(Hugo_Symbol,AlteredSamples) %>%
  arrange() %>% head(10)
## -Processing clinical data
## --Missing clinical data
Variant_Type <- BROAD_mutdelim_pan %>% group_by(Variant_Type) %>%
count() %>% rename(frequency = n)

var_plot <- ggplot(data=Variant_Type, aes(x=Variant_Type, y = frequency, fill=Variant_Type))+
  geom_bar(stat="identity") +
  coord_flip() +
  ggtitle("Variant Type") +
  ylab("Frequency") +
  xlab("Variant Type")

ggplot(data=genesum, aes(x=reorder(Hugo_Symbol, AlteredSamples), y = AlteredSamples, fill = AlteredSamples))+
  geom_bar(stat="identity") +
  coord_flip() +
  ggtitle("Top 10 Mutated Genes Based on Altered Samples") +
  ylab("Number of Altered Samples") +
  xlab("Gene Names")+
  theme(legend.position = "right")

 ggplotly(var_plot)
0300006000090000DELINSSNP
DELINSSNPVariant TypeFrequencyVariant TypeVariant_Type
#check the missing values of the data set
BROAD_mutdelim_pan[BROAD_mutdelim_pan == "NA"] <- NA
table(is.na(BROAD_mutdelim_pan)) #FALSE: 2143308   , TRUE: 2256126  
#There are missing values

The GDSC SANGER Data

Heatmap of the CNV data was plotted. There are no missing values present within the dataset. The dimensions of the CNV data is 891 x 883.

#read the data
SANGER_cnv_pan <- read.delim("data/GDSC_sanger/oncoKB_intersect_genes/cnv.tsv") %>%
  rename(cell_lines = X) #Change the name of cell lines column to cell_lines

cols <- colorRampPalette(brewer.pal(6,name="Oranges"))(12)

pheatmap::pheatmap(SANGER_cnv_pan[,2:50],cluster_rows=FALSE,show_rownames = FALSE,
                     main="Copy Number Data from the GDSC SANGER", col=cols)

#check the missing values of the data set
SANGER_cnv_pan[SANGER_cnv_pan == "NA"] <- NA
table(is.na(SANGER_cnv_pan)) #FALSE:786753     
#There are no missing values

The frequency of each variant type and, top 10 mutated genes were plotted.

NOTE : There are missing values present in the mutation data.

SANGER_mutdelim_pan <-data.table::fread(
  file = "data/GDSC_sanger/oncoKB_intersect_genes/mut.tsv",sep = "\t", 
  header = TRUE, fill = TRUE)#read the mutation data 

genesum <- maftools::getGeneSummary(MAF(SANGER_mutdelim_pan)) %>%
  group_by(Hugo_Symbol) %>% select(Hugo_Symbol,AlteredSamples) %>% arrange() %>%
  head(10)
## -Processing clinical data
## --Missing clinical data
Variant_Type <- SANGER_mutdelim_pan %>% group_by(Variant_Type) %>% count() %>%
  rename(frequency = n)

var_plot <- ggplot(data=Variant_Type, aes(x=Variant_Type, 
                                          y = frequency, fill=Variant_Type))+
  geom_bar(stat="identity") +
  coord_flip() +
  ggtitle("Variant Type") +
  ylab("Frequency") +
  xlab("Variant Type")

ggplot(data=genesum, aes(x=reorder(Hugo_Symbol, AlteredSamples), y = AlteredSamples, fill = AlteredSamples))+
  geom_bar(stat="identity") +
  coord_flip() +
  ggtitle("Top 10 Mutated Genes Based on Altered Samples") +
  ylab("Number of Altered Samples") +
  xlab("Gene Names")+
  theme(legend.position = "right")

 ggplotly(var_plot)
0300006000090000DELINSSNP
DELINSSNPVariant TypeFrequencyVariant TypeVariant_Type
#check the missing values of the data set
SANGER_mutdelim_pan[SANGER_mutdelim_pan == "NA"] <- NA
table(is.na(SANGER_mutdelim_pan)) #FALSE: 2143308 , TRUE: 2256126    
#There are missing values

The PDX Data

There are no missing values present. Heatmap of the CNV data was plotted. The dimension of the CNV data is 399 x 883.

#read the data
PDX_cnv_pan <- read.delim("data/PDX_all/oncoKB_intersect_genes/cnv.tsv") %>%
  rename(cell_lines = X) #Change the name of cell lines column to cell_lines

cols <- colorRampPalette(brewer.pal(6,name="Purples"))(12)
pheatmap::pheatmap(PDX_cnv_pan[,2:50],cluster_rows=FALSE,show_rownames = FALSE,
                     main="Copy Number Data from the PDX",color = cols)

#check the missing values of the data set
PDX_cnv_pan[PDX_cnv_pan == "NA"] <- NA
table(is.na(PDX_cnv_pan)) #FALSE:352317      
#There are no missing values

The histogram of the mutation frequency values were plotted.

PDX_mutdelim_pan <- read.delim(file = "data/PDX_all/oncoKB_intersect_genes/mut.tsv", 
                               sep = "\t", header = TRUE, fill = TRUE,
                            comment.char = "#") #read the mutation data 

hist(unlist(PDX_mutdelim_pan[,84]),xlab="Mutation Values",
     main="The PDX Mutation Frequency",border="white",
     col="Purple",bins = 10)

#check the missing values of the data set
PDX_mutdelim_pan[PDX_mutdelim_pan == "NA"] <- NA
table(is.na(PDX_mutdelim_pan)) #FALSE: 75781 , TRUE: 787536     
#There are missing values

The Drug Response Data

As mentioned in the introduction section, there are different drug response data present for six different drugs within the drug response folder. These are the six common drugs that are used for the treatment of several cancer types.

The distribution of each drug was plotted. Check the difference in drug response distribution for each drug.

NOTE : TheCCLE drug response data does not contain any missing values.

# Read and Combine all of the response data 
drug_resp_CCLE <- read.delim("data/CCLE/drug_response/ALPELISIB.tsv") %>%
  dplyr::bind_rows(read.delim("data/CCLE/drug_response/BUPARLISIB.tsv")) %>%
  dplyr::bind_rows(read.delim("data/CCLE/drug_response/CETUXIMAB.tsv")) %>%
  dplyr::bind_rows(read.delim("data/CCLE/drug_response/ERLOTINIB.tsv")) %>%
  dplyr::bind_rows(read.delim("data/CCLE/drug_response/RIBOCICLIB.tsv")) %>%
  dplyr::bind_rows(read.delim("data/CCLE/drug_response/TRAMETINIB.tsv")) %>%
  rename(drug = column_name)

table(drug_resp_CCLE$drug) #Print how many response data present 
## 
##  ALPELISIB BUPARLISIB  CETUXIMAB  ERLOTINIB RIBOCICLIB TRAMETINIB 
##        785        736        855        839         45        930
drug_plot <- drug_resp_CCLE %>% group_by(drug) %>% 
  ggplot(aes(x = value,fill=drug)) +
  geom_histogram(color="black",bins = 50) +
  facet_wrap(drug ~ ., scales = 'free_x')+
  scale_y_sqrt()+
  ggtitle("Distribution of the Response Value Among the Drugs :the CCLE")+
  xlab("Response Values")+
  ylab("Frequency")

ggplotly(drug_plot)
0.60.70.80.91.01002003000.50.60.70.80.90.250.500.751.000.40.60.81.01002003000.920.940.960.980.000.250.500.751.00
ALPELISIBBUPARLISIBCETUXIMABERLOTINIBRIBOCICLIBTRAMETINIBDistribution of the Response Value Among the Drugs :the CCLEResponse ValuesFrequencyALPELISIBBUPARLISIBCETUXIMABERLOTINIBRIBOCICLIBTRAMETINIBdrug
drug_resp_CCLE[drug_resp_CCLE == "NA"] <-  NA 
#Introduce missing values
table(drug_resp_CCLE$drug, is.na(drug_resp_CCLE$value)) 
# check if there are missing values

Histograms were plotted to represent the distributions of the response values for each drug. Check the plots. Does it look like the distribution of the CCLE’s drug response values?

NOTE : There are no missing values present within the GDSC BROAD dataset.

# Read and Combine all of the response data 
drug_resp_BROAD <- read.delim("data/GDSC_broad/drug_response/ALPELISIB.tsv") %>%
  dplyr::bind_rows(read.delim("data/GDSC_broad/drug_response/BUPARLISIB.tsv"))%>%
  dplyr::bind_rows(read.delim("data/GDSC_broad/drug_response/CETUXIMAB.tsv"))%>%
  dplyr::bind_rows(read.delim("data/GDSC_broad/drug_response/ERLOTINIB.tsv"))%>%
  dplyr::bind_rows(read.delim("data/GDSC_broad/drug_response/RIBOCICLIB.tsv"))%>%
  dplyr::bind_rows(read.delim("data/GDSC_broad/drug_response/TRAMETINIB.tsv"))%>%
  rename(drug = column_name)

table(drug_resp_BROAD$drug) #Print how many response data present 
## 
##  ALPELISIB BUPARLISIB  CETUXIMAB  ERLOTINIB RIBOCICLIB TRAMETINIB 
##        761        712        806       1072         43       1550
drug_plot <- drug_resp_BROAD %>% group_by(drug) %>% 
  ggplot(aes(x = value , fill=drug)) +
  geom_histogram(color="black",bins = 50) +
  facet_wrap(drug ~ ., scales = 'free_x')+
  scale_y_sqrt()+
  ggtitle("Distribution of the Response Values: the GDSC BROAD")+
  xlab("Response Values")+
  ylab("Frequency")

ggplotly(drug_plot)
0.60.70.80.91.01002003000.50.60.70.80.90.250.500.751.000.50.70.91002003000.940.960.980.000.250.500.751.00
ALPELISIBBUPARLISIBCETUXIMABERLOTINIBRIBOCICLIBTRAMETINIBDistribution of the Response Values: the GDSC BROADResponse ValuesFrequencyALPELISIBBUPARLISIBCETUXIMABERLOTINIBRIBOCICLIBTRAMETINIBdrug
drug_resp_BROAD[drug_resp_BROAD == "NA"] <-  NA 
#Introduce missing values
table(drug_resp_BROAD$drug, is.na(drug_resp_BROAD$value)) 
# check if there are missing values

Histograms were plotted to represent the distributions of the response values for each drug. Check the plots. Does it look like the distribution of the CCLE’s and BROAD’S drug response values? In addition, the number of samples for each drug was printed in the table.

NOTE : There are no missing values present within the GDSC SANGER dataset.

# Read and Combine all of the response data 
drug_resp_sanger <- read.delim("data/GDSC_sanger/drug_response/ALPELISIB.tsv")%>%
  dplyr::bind_rows(read.delim("data/GDSC_sanger/drug_response/BUPARLISIB.tsv"))%>%
  dplyr::bind_rows(read.delim("data/GDSC_sanger/drug_response/CETUXIMAB.tsv"))%>%
  dplyr::bind_rows(read.delim("data/GDSC_sanger/drug_response/ERLOTINIB.tsv"))%>%
  dplyr::bind_rows(read.delim("data/GDSC_sanger/drug_response/RIBOCICLIB.tsv"))%>%
  dplyr::bind_rows(read.delim("data/GDSC_sanger/drug_response/TRAMETINIB.tsv"))%>%
  rename(drug = column_name)

table(drug_resp_sanger$drug) #Print how many response data present 
## 
##  ALPELISIB BUPARLISIB  CETUXIMAB  ERLOTINIB RIBOCICLIB TRAMETINIB 
##        761        712        806       1072         43       1550
drug_plot <- drug_resp_sanger %>% group_by(drug) %>% 
  ggplot(aes(x = value , fill=drug)) +
  geom_histogram(color="black",bins = 50) +
  facet_wrap(drug ~ ., scales = 'free_x')+
  scale_y_sqrt()+
  ggtitle("Distribution of the Response Value Among the Drugs :GDSC SANGER")+
  xlab("Response Values")+
  ylab("Frequency")

ggplotly(drug_plot)
0.60.70.80.91.01002003000.50.60.70.80.90.250.500.751.000.50.70.91002003000.940.960.980.000.250.500.751.00
ALPELISIBBUPARLISIBCETUXIMABERLOTINIBRIBOCICLIBTRAMETINIBDistribution of the Response Value Among the Drugs :GDSC SANGERResponse ValuesFrequencyALPELISIBBUPARLISIBCETUXIMABERLOTINIBRIBOCICLIBTRAMETINIBdrug
drug_resp_sanger[drug_resp_sanger == "NA"] <-  NA 
#Introduce missing values
table(drug_resp_sanger$drug, is.na(drug_resp_sanger$value)) 
# check if there are missing values

Histograms were plotted to represent the distributions of the response values for each drug. Check the plots. Does it look like the distribution of the CCLE’s, SANGER’s and BROAD’S drug response values? In addition, the number of samples for each drug was printed in the table.

# Read and Combine all of the response data 
drug_resp_PDX <- read.delim("data/PDX_all/drug_response/ALPELISIB.tsv") %>%
  dplyr::bind_rows(read.delim("data/PDX_all/drug_response/BUPARLISIB.tsv")) %>%
  dplyr::bind_rows(read.delim("data/PDX_all/drug_response/CETUXIMAB.tsv")) %>%
  dplyr::bind_rows(read.delim("data/PDX_all/drug_response/ERLOTINIB.tsv")) %>%
  dplyr::bind_rows(read.delim("data/PDX_all/drug_response/RIBOCICLIB.tsv")) %>%
  dplyr::bind_rows(read.delim("data/PDX_all/drug_response/TRAMETINIB.tsv")) %>%
  rename(drug = column_name)

table(drug_resp_PDX$drug) #Print how many response data present 
## 
##  ALPELISIB BUPARLISIB  CETUXIMAB  ERLOTINIB RIBOCICLIB TRAMETINIB 
##        281        281        281        281        281        281
drug_plot <- drug_resp_PDX %>% group_by(drug) %>%
  ggplot(aes(x = value , fill=drug)) +
  geom_histogram(color="black",bins = 50) +
  facet_wrap(drug ~ ., scales = 'free_x')+
  scale_y_sqrt()+
  ggtitle("Distribution of the Response Value Among the Drugs :the PDX")+
  xlab("Response Values")+
  ylab("Frequency")

ggplotly(drug_plot)
0.000.250.500.751.001020300.000.250.500.751.000.000.250.500.751.000.000.250.500.751.001020300.000.250.500.751.000.000.250.500.751.00
ALPELISIBBUPARLISIBCETUXIMABERLOTINIBRIBOCICLIBTRAMETINIBDistribution of the Response Value Among the Drugs :the PDXResponse ValuesFrequencyALPELISIBBUPARLISIBCETUXIMABERLOTINIBRIBOCICLIBTRAMETINIBdrug
  • NOTE : There are missing values present within the drug response data set. In the table you can see the number of missing values for each drugs.
drug_resp_PDX[drug_resp_PDX == "NA"] <-  NA 
#Introduce missing values
table(drug_resp_PDX$drug, is.na(drug_resp_PDX$value)) 
##             
##              FALSE TRUE
##   ALPELISIB    212   69
##   BUPARLISIB   224   57
##   CETUXIMAB     72  209
##   ERLOTINIB     29  252
##   RIBOCICLIB   246   35
##   TRAMETINIB    39  242
# check if there are missing values

References

Cho, Sung-Yup, Wonyoung Kang, Jee Yun Han, Jee Kwon, Charles Lee, and Hansoo Park. 2016. “An Integrative Approach to Precision Cancer Medicine Using Patient-Derived Xenografts.” Molecules and Cells 39 (February). https://doi.org/10.14348/molcells.2016.2350.
Ghandi, Mahmoud, Franklin Huang, Judit Jané-Valbuena, Gregory Kryukov, Christopher Lo, E. McDonald, Jordi Barretina, et al. 2019. “Next-Generation Characterization of the Cancer Cell Line Encyclopedia.” Nature 569 (May): 1–6. https://doi.org/10.1038/s41586-019-1186-3.
Jung, Jaeyun, Hyang Seol, and Suhwan Chang. 2017. “The Generation and Application of Patient Derived Xenograft (PDX) Model for Cancer Research.” Cancer Research and Treatment 50 (September). https://doi.org/10.4143/crt.2017.307.
Ravi, Maddaly, Aarthi Ramesh, and Aishwarya Pattabhi. 2016. “Contributions of 3d Cell Cultures for Cancer Research.” Journal of Cellular Physiology 232 (October). https://doi.org/10.1002/jcp.25664.
Yang, Wanjuan, Jorge Soares, Patricia Greninger, Elena Edelman, Howard Lightfoot, Simon Forbes, Nidhi Bindal, et al. 2012. “Abstract 2206: Genomics of Drug Sensitivity in Cancer (GDSC): A Resource for Therapeutic Biomarker Discovery in Cancer Cells.” Nucleic Acids Research 41 (November). https://doi.org/10.1093/nar/gks1111.